pollution

image by Magda

CMSC 320 Final Project – US Pollution

Contributors: Michael Wang, Trueman Phamdo, Jason Lee

Introduction

Air pollution basics

According to the World Health Organization, air pollution kills an estimated 7 million people each year. This is the result of a combined effect of ambient (outdoor) pollution and household air pollution. Ambient air pollution contributes an estimated 4.2 million deaths per year to stroke, heart disease, lung cancer and chronic respiratory diseases, whilst household air pollution contributes an estimated 3.8 million deaths per year due to exposure to smoke from cooking fires. In this exploration, we will be primarily focused on ambient air pollution. To find more information, you can visit this WHO page.

WHO data finds that 9 out of every 10 people breathe air containing a high level of pollutants. Pollutants, simply put, are the presence of harmful, unwanted substances in the air that reduce air quality. In this tutorial, we will mainly explore 4 of the 5 main types of pollutants that are present throughout the United States ($PM$ is not in the database):

  • Sulphur dioxide ($SO_2$) : This contaminant is mainly emitted during the combustion of fossil fuels such as crude oil and coal.
  • Carbon monoxide ($CO$) : This gas consists during incomplete combustion of fuels example : A car engine running in a closed room.
  • Nitrogen dioxide ($NO_2$) : These contaminants are emitted by traffic, combustion installations and the industries.
  • Ozone ($O_3$) : Ozone is created through the influence of ultra violet sunlight (UV) on pollutants in the outside air.
  • Particulate Matter ( $PM$ ) : Particulate matter is the sum of all solid and liquid particles suspended in air. This complex mixture includes both organic and inorganic particles, such as dust, pollen, soot, smoke, and liquid droplets. These particles vary greatly in size, composition, and origin.

Note: Predictions might have been more accurate if the dataset contained particle pollution (or particulate matter - $PM$) level too, which would cause similarly important health issues.

Given how dangerous pollutants can be, we want to explore any trends regarding the presence of these pollutants in our air. We hope we can gain better insight into how air quality has changed over time and maybe find factors that contribute to higher pollutant levels.

Necessary libraries

Before we do anything, we must import the following libraries

In [1]:
## import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from pandas.plotting import register_matplotlib_converters
import plotly.offline as pyo
import plotly.graph_objs as go
import plotly.express as px
import statsmodels.formula.api as sm

Data preparation

About this dataset

Now that we have all the libaries we need, we can get our dataset. We plan on using a dataset scraped from the U.S. EPA database. It contains data on four major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) for every day from 2000 - 2016.

Loading the Dataset

First, We must load in our dataset. We are going to load in our dataset from a CSV file locally stored at file path '../input'. You can find and download the dataset on Kaggle under US Pollution.

In [2]:
## read csv
df = pd.read_csv("../input/pollution_us_2000_2016.csv")
print('** successfully imported csv as dataframe **')
** successfully imported csv as dataframe **

Tidying

Oftentimes, we are presented with messy data, and we must clean the data so it becomes easier to work with. We also want to tidy the data. Tidy data is a standard method of displaying a multivariate set of data where the data set is arragned so that each attribute is a column and each entity is a row. Tidy Data is much easier to work with!

In this dataset, there are a large amount of columns which either do not provide any valuable information, repeat information presented in other columns, or we simply do not use. We delete such columns.

  • Index column (Unnamed: 0) serves no purpose.
  • State Code, Country Code, Site Num, Address are removed since State, Country and City columns are sufficent.
  • All the units columns (NO2 Units, etc.) are removed since they only have 1 unique value representing their unit.
  • All 1st Max Value and 1st Max Hour columns are removed since they will not be used
In [3]:
# drop unused columns
df = df.drop(['Unnamed: 0','State Code','County Code','Site Num','Address','NO2 Units','O3 Units','SO2 Units',
              'CO Units','NO2 1st Max Value', 'NO2 1st Max Hour','O3 1st Max Value', 'O3 1st Max Hour',
              'SO2 1st Max Value', 'SO2 1st Max Hour','CO 1st Max Value', 'CO 1st Max Hour'],axis=1)
df.head()
Out[3]:
State County City Date Local NO2 Mean NO2 AQI O3 Mean O3 AQI SO2 Mean SO2 AQI CO Mean CO AQI
0 Arizona Maricopa Phoenix 2000-01-01 19.041667 46 0.022500 34 3.000000 13.0 1.145833 NaN
1 Arizona Maricopa Phoenix 2000-01-01 19.041667 46 0.022500 34 3.000000 13.0 0.878947 25.0
2 Arizona Maricopa Phoenix 2000-01-01 19.041667 46 0.022500 34 2.975000 NaN 1.145833 NaN
3 Arizona Maricopa Phoenix 2000-01-01 19.041667 46 0.022500 34 2.975000 NaN 0.878947 25.0
4 Arizona Maricopa Phoenix 2000-01-02 22.958333 34 0.013375 27 1.958333 4.0 0.850000 NaN

As seen above, some entries are duplications with the same observation date (Date Local column). Since there's no specific explanation for these duplications, we will calculate the mean values for each date and city.

Also, there are is a lot of missing data represented as NaN. We delete these rows (we do this here because there are a lot of duplicate data points that can cover up for the rows we lose).

We also delete entries with 'Country of Mexico' as its state since we are only dealing with pollutions in the U.S.

Lastly, change Date Local from string to datetime

In [4]:
# replace duplications with a single entry of mean values
df = df.groupby(['State','County','City','Date Local']).mean().reset_index()

# remove entries with NA
df = df.dropna(axis='rows')

# remove Mexico
df = df[df.State!='Country Of Mexico']

# Change date from string to date value
df['Date Local'] = pd.to_datetime(df['Date Local'], format='%Y-%m-%d')

The air quality level of each category ($NO_{2}$, etc.) is defined by the Air Quality Index (AQI). More detailed information on the Index can be found in the EPA Air Quality Basic Page.

According to the Air Quality Index guide provided by EPA (page 3), the highest of AQI values for each category is reported as the overall AQI value.

We create a new column to record the overall AQI.

In [5]:
# overall AQI
df['AQI'] = df[['NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI']].max(axis=1)

Exploratory data analysis

Looking at Maryland

We want to explore our data! There is plenty of insight to be found in the dataset and using plots and other visualization tools helps make these insights clear. As an example, we look at the Air Quality Index of Maryland across time. First we must get the data we want by using various pandas methods. Here, we use the groupby and mean method to get the monthly average AQI for Maryland.

In [6]:
# looking at Maryland
maryland = df[df['State'] == 'Maryland'] 

# Get monthly mean AQI values
maryland = maryland.groupby([maryland['Date Local'].dt.year.values, maryland['Date Local'].dt.month.values]).mean()
maryland = maryland.reset_index()

# cleaning datatable by clarifying column names
maryland['Date Local'] = pd.to_datetime(dict(year=maryland['level_0'], month=maryland['level_1'], day=1))
maryland = maryland.drop(['level_0', 'level_1'], axis=1)

We want to plot the data, so we use the matplotlib library and its subplots() method to look at individual graphs for $NO_2$, $O_3$, $SO_2$, $CO$, and overall AQI next to each other.

In [7]:
# Plotting
register_matplotlib_converters() #helps with date-time conversion from pandas
plt.rcParams['figure.figsize'] = [16, 8] #adjust graph width and length

# now we plot NO2 AQI, O3 AQI, SO2 AQI, CO AQI, and overall AQI
fig, axs = plt.subplots(2, 3)

axs[0,0].plot('Date Local', 'NO2 AQI', data=maryland) # NO2
axs[0,0].set_title('NO2 AQI')

axs[0,1].plot('Date Local', 'O3 AQI', data=maryland) # O3
axs[0,1].set_title('O3 AQI')

axs[1,0].plot('Date Local', 'SO2 AQI', data=maryland) # SO2
axs[1,0].set_title('SO2 AQI')

axs[1,1].plot('Date Local', 'CO AQI', data=maryland) # CO
axs[1,1].set_title('CO AQI')

axs[0,2].plot('Date Local', 'AQI', data=maryland) # AQI
axs[0,2].set_title('AQI')

fig.delaxes(axs[1, 2]) # remove unused plot
fig.suptitle("Marlyand Air Quality Metrics Across Time") #set title
Out[7]:
Text(0.5, 0.98, 'Marlyand Air Quality Metrics Across Time')

We initially see here that $SO_{2}$ AQI and $NO_{2}$ AQI have generally decreased in Maryland, while $O_{3}$ AQI, $CO$ AQI, and the overall AQI have been more level. However, we notice something more interesting. We see that each AQI metric appears to follows a sinusoidal pattern, suggesting that there may a seasonal element to AQI in Maryland. We want to explore this further.

Exploring seasonality of AQI

We will find the average overall AQI for each month of the year in our entire dataset to see if this pattern persists throughout the country. We will then plot a bar graph of these means. We first need to extract out the month from each data point in our dataset.

In [8]:
# We first capture the month of each entity

# mapping from month number to month name
month_num_to_name = {
    1 : 'January',
    2 : 'February',
    3 : 'March',
    4 : 'April',
    5 : 'May',
    6 : 'June',
    7 : 'July',
    8 : 'August',
    9 : 'September',
    10 : 'October',
    11 : 'November',
    12 : 'December'
}

# capture month number from original dataframe and apply above mapping
df_month = df.copy()
df_month['month'] = df['Date Local'].dt.month
df_month['month'] = df_month['month'].map(month_num_to_name)

# set order to categorical month value
df_month.month = pd.Categorical(df_month.month, 
                      categories=['January','February','March','April','May','June','July',
                                  'August','September','October','November','December'],
                      ordered=True)

Now that we have extracted out the months, we can calculate the mean AQI for each month. We then plot these results.

In [9]:
# plotting

# get each months mean
df_month_mean = df_month.groupby(['month']).mean()
df_month_mean = df_month_mean.reset_index()

# plot 
df_month_mean.plot.bar(x='month', y='AQI', title="Mean Overall AQI by Month")
plt.ylabel('AQI')
plt.show()

Surely enough, this seasonal hypothesis seems to persist throughout the dataset. The summer months all appear to have relatively poorer air quality whilst the winter months have much better air quality. To find how significant this finding is, we are going to use the statsmodels.formula.api library to fit a linear regression model using a month term.

In [10]:
res = sm.ols('AQI~month', data=df_month).fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    AQI   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.075
Method:                 Least Squares   F-statistic:                     2862.
Date:                Mon, 18 May 2020   Prob (F-statistic):               0.00
Time:                        20:23:29   Log-Likelihood:            -1.6876e+06
No. Observations:              389778   AIC:                         3.375e+06
Df Residuals:                  389766   BIC:                         3.375e+06
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             34.3800      0.108    318.123      0.000      34.168      34.592
month[T.February]      1.8642      0.156     11.922      0.000       1.558       2.171
month[T.March]         4.5468      0.151     30.122      0.000       4.251       4.843
month[T.April]         8.1952      0.145     56.494      0.000       7.911       8.480
month[T.May]          10.4817      0.145     72.450      0.000      10.198      10.765
month[T.June]         13.0961      0.146     89.752      0.000      12.810      13.382
month[T.July]         12.6382      0.145     87.059      0.000      12.354      12.923
month[T.August]       12.2110      0.145     84.100      0.000      11.926      12.496
month[T.September]     6.9447      0.146     47.536      0.000       6.658       7.231
month[T.October]       1.3335      0.146      9.133      0.000       1.047       1.620
month[T.November]     -0.1799      0.156     -1.155      0.248      -0.485       0.125
month[T.December]     -1.6850      0.154    -10.916      0.000      -1.988      -1.383
==============================================================================
Omnibus:                   203688.855   Durbin-Watson:                   0.760
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1783318.827
Skew:                           2.376   Prob(JB):                         0.00
Kurtosis:                      12.340   Cond. No.                         13.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretting this statsmodel summary is a little difficult, but we will walk through some results to clarify. To find more information, see the Stats Model website.

The Intercept here represents the month of January. The coefficent means that the average AQI in January is 34.38 with a standard error of .108. Looking at the P>|t| column, we see that the p-value < .05, so we can reject the null hypothesis of no-relationship. Looking one row below, we see that February has a coefficient of 1.86, meaning that the AQI in February is on average 1.86 greater in February than in January. We also see that on average the AQI is 13.09 greater in June than in January. The p-value for each month except November is less than .05, meaning all paramters are significantly different from zero. We can reject the null hypothesis of no significance.

We find this result quite interesting, and we had no idea that we would find this result. This highlights how you can find new insights by exploring the data more closely!

AQI Over Time for U.S.

Now we move onto something different. Earlier we plotted Maryland's AQI over time. However, we are also interested in how AQI has changed over time for the rest of the United States.

We first are going to look at the 6 states with with highest (worst) AQI mean from 2000-2016. Let's get a list of those states, and then let's plot the yearly mean AQI for each, similarly to what we did with Maryland.

In [11]:
# top 6 worst states by mean AQI from 2000-2016
worst_states = df.groupby(['State']).mean().reset_index().sort_values(by='AQI', ascending=False).iloc[:6].State 
worst_states = worst_states.array
In [12]:
# plotting yearly mean of worst air quality states
df2 = df.copy()
df2 = df2.groupby([df2.State, df2['Date Local'].dt.year.values]).mean().reset_index()
df2['Date Local'] = pd.to_datetime(dict(year=df2['level_1'], month=1, day=1))
df2 = df2.drop(['level_1'], axis=1)

fig, axs = plt.subplots(2, 3)
for state in worst_states:
    
    axs[0,0].plot('Date Local', 'NO2 AQI', data=df2[df2.State == state], label=state)
    axs[0,0].set_title('NO2 AQI')

    axs[0,1].plot('Date Local', 'O3 AQI', data=df2[df2.State == state], label=state)
    axs[0,1].set_title('O3 AQI')

    axs[1,0].plot('Date Local', 'SO2 AQI', data=df2[df2.State == state], label=state)
    axs[1,0].set_title('SO2 AQI')

    axs[1,1].plot('Date Local', 'CO AQI', data=df2[df2.State == state], label=state)
    axs[1,1].set_title('CO AQI')

    axs[0,2].plot('Date Local', 'AQI', data=df2[df2.State == state], label=state)
    axs[0,2].set_title('AQI')
    
    axs[1,2].plot(label=state)
    
axs[1,0].legend(loc='best') #add legend

fig.delaxes(axs[1, 2]) # remove unused plot
fig.suptitle("Worst States Air Quality Metrics (Yearly Mean) Across Time") #set title
Out[12]:
Text(0.5, 0.98, 'Worst States Air Quality Metrics (Yearly Mean) Across Time')

These charts give us some insight. For example, we can see that for the worst states, $SO_{2}$ AQI and $CO$ AQI has generally gone down, while $NO_{2}$ AQI, $O3$ AQI, and overall AQI have - for the most part - stayed the same.

The charts give us a good idea of how different states have performed in terms of AQI over time. However, we would like to see information for all states, yet plotting all states on one chart would be too cluttered. To resolve this issue, we use the plotly.express library to show all the states at once on a map. This gives us a different visualization tool that might give us some new insights.

We plan on using the plotly.express.chorolopleth method which requires the U.S. states to be in its 2-letter abbreviation form. Thus we use a dictionary taken from this github page mapping U.S. state names to their 2-letter abbreviations. We load in the dictionary saved at file location "./dictionaries/us_state_abbrev_dict.npy" below, which creates a state name to abbreviation mapping, and then from that we create a mapping in the reverse direction.

In [13]:
# loads dictionary that maps from full U.S. state name to two-letter abbreviation
us_state_abbrev = np.load('dictionaries/us_state_abbrev_dict.npy', allow_pickle='TRUE').item()

# creates a dictionary that maps in the reverse direction
abbrev_us_state = dict(map(reversed, us_state_abbrev.items()))

Now we want to get the data we are looking for. Specifically, we are going to look at the yearly average overall AQI for each state, so we can visualize states air quality over time. We use various pandas methods below to do so.

In [14]:
# makes copy of original dataframe
df_states = df.copy()

# gets state yearly mean AQI
df_states = df_states.groupby(['State', df_states['Date Local'].dt.year.values]).mean().reset_index()

# rename column names
df_states = df_states.rename(columns={'level_1': 'year', 'AQI': 'Mean AQI'})

# get the used columns
df_states = df_states[['State','Mean AQI', 'year']]

# map the state names to the corresponding 2-letter abbreviation
df_states['State'] = df_states['State'].map(us_state_abbrev)

We also create a new column called 'AQI range' that categorizes the mean AQI into ranges. This will help us color-coordinate each state's AQI when we create our plotly map.

In [15]:
def AQI_range(mean_aqi):
    if mean_aqi < 30:
        return '< 30'
    elif mean_aqi < 35:
        return '30-35'
    elif mean_aqi < 40:
        return '35-40'
    elif mean_aqi < 45:
        return '40-45'
    else:
        return '> 45'
    
# apply above function to dataframe
df_states['AQI range'] = df_states['Mean AQI'].apply(AQI_range)
df_states.head()
Out[15]:
State Mean AQI year AQI range
0 AL 25.161290 2013 < 30
1 AL 38.062914 2014 35-40
2 AL 37.846667 2015 35-40
3 AL 39.724832 2016 35-40
4 AK 26.385475 2014 < 30

Now that we have the data we want, we use the plotly.express library and its choropleth() method to visualize the data. We can play the animation to see how different states' overall AQI has varied year to year.

In [16]:
# Set notebook mode to work in offline
pyo.init_notebook_mode()  

# set map settings
fig = px.choropleth(df_states, locations='State', locationmode="USA-states", 
                    color='AQI range', scope="usa",  
                    color_discrete_map={
                        '< 30': "grey", 
                        '30-35': "royalblue", 
                        '35-40': "purple", 
                        '40-45': "orange",
                        '> 45': "red" 
                    },
                    category_orders={"year": list(range(2000,2017))}, animation_frame="year",
                    title="Heat Map of States Mean AQI 2000-2016")

# show map
fig.show()

Unfortunately, as is common in many datasets, we are missing data for different states in different years, explaining why some states are greyed out. However, we can still see some cool trends. For example, it appears that Northern states like North Dakota, Idaho, Seattle, Maine, etc have all generally had better AQIs as they are more typically in the < 30 AQI range. Moreover, the county as a whole seems to be have an improving AQI over time. In the 2000-2005 range, it was commonplace to see mean AQIs > 45. Yet from 2006-2016, no state in the dataset had a mean AQI > 40.

Exploring Decline in AQI Over Time

As we noted by looking at the above map, it appears that overall AQI has decreased over time in the U.S. We want to explore this further and see how significant this finding is. First we are going to plot the mean AQI in our dataset for each year 2000-2016.

In [17]:
df_year = df_month
df_year['year'] = df_year['Date Local'].dt.year.values
df_year_mean = df_year.groupby(['year']).mean().reset_index()

# plot 
df_year_mean.plot.bar(x='year', y='AQI', title="Mean Overall AQI by Year")
plt.ylabel('AQI')
plt.show()

There does appear to be a decline in overall AQI, especially in the years 2002-2009. As we did earlier with each month, lets find the significance of these results.

In [18]:
res = sm.ols('AQI~year', data=df_month).fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    AQI   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     7347.
Date:                Mon, 18 May 2020   Prob (F-statistic):               0.00
Time:                        20:24:19   Log-Likelihood:            -1.6991e+06
No. Observations:              389778   AIC:                         3.398e+06
Df Residuals:                  389776   BIC:                         3.398e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1147.3446     12.911     88.863      0.000    1122.039    1172.651
year          -0.5511      0.006    -85.712      0.000      -0.564      -0.538
==============================================================================
Omnibus:                   212220.690   Durbin-Watson:                   0.716
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1905499.114
Skew:                           2.497   Prob(JB):                         0.00
Kurtosis:                      12.612   Cond. No.                     8.55e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.55e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We see that on average the overall AQI has declined by .5511 each year from 2000-2016, and that this finding is significantly different from zero (p-value < .05). This is quite interesting. Before beginning this exploratio, we thought that air quality would be decreasing over time because of global warming, but the data suggests otherwise air quality in the U.S. is actually improving. We now have found that both year and month have a significant relationship with overall AQI. As a point of inquiry, one may want to create a linear regression with both a month and year term and see how well the model fits the data.

Cities AQI Over Time

We have now looked at how states AQI has varied over time. Another thing that we are interested in is how different cities AQIs have varied overtime. To do this, we want to get the longitude and latitudes of each city, so we can map the locations onto a plotly.graph_objects.Scattergeo() map. This piece of code uses OpenCageData's free API, which allows us to get longitude and latitudes based on location names we input. For example, this piece of code allows us to get the latitude and longitude of College Park, Maryland.

geocode_example

We want to do this for all unique cities in our dataset. So we first grab all the unique cities.

In [19]:
# Gets List of unique cities in our dataset
df_unique_cities = df.copy()
df_unique_cities = df_unique_cities[['State', 'City']]
df_unique_cities = df_unique_cities.groupby(['State', 'City']).size().reset_index().rename(columns={0:'count'})
df_unique_cities['City'] = df_unique_cities.City + ', ' + df_unique_cities.State
df_unique_cities = df_unique_cities[['City']]
unique_cities = df_unique_cities.City.array

We then execute the following code to get the longitude and latitudes of all distinct cities in our dataframe.

geocode_example2

Using the API takes a bit of time and we have limited free requests, so we stored the dictionary resulting from the above code in a file located at "./dictionaries/city_to_lon_lat_dict.npy" using the np.save() method. To run the code yourself, you can get a free API key from OpenCageData. We now read from the saved file.

In [20]:
# read in dictionary mapping city names to longitude/latitude
city_to_lon_lat = np.load('dictionaries/city_to_lon_lat_dict.npy', allow_pickle='TRUE').item()

We now need to create a dataframe that has all the unique cities along with their corresponding longitudes and latitudes. We apply the above mapping to do so.

In [21]:
df_cities = df.copy() # copy original df

# gets yearly mean AQI for each city 
df_cities['City'] = df_cities['City'] + ', ' + df_cities['State']
df_cities = df_cities[['City', 'Date Local', 'AQI']]
df_cities = df_cities[df_cities['City'].isin(unique_cities)] 
df_cities = df_cities.groupby(['City', df_cities['Date Local'].dt.year]).mean().reset_index()
df_cities = df_cities.rename(columns={'Date Local': 'year', 'AQI' : 'Mean AQI'})

# applies the dictionary mapping to insert longitude and latitudes into the datafram for each city
df_cities['lon lat'] = df_cities.City.map(city_to_lon_lat)

# separate longitude and latitude into separate columns
df_cities[['lon','lat']] = pd.DataFrame(df_cities['lon lat'].tolist(), index=df_cities.index)

df_cities.head()
Out[21]:
City year Mean AQI lon lat lon lat
0 Albuquerque, New Mexico 2011 43.141243 (-106.6509851, 35.0841034) -106.650985 35.084103
1 Albuquerque, New Mexico 2012 43.523677 (-106.6509851, 35.0841034) -106.650985 35.084103
2 Albuquerque, New Mexico 2013 46.298883 (-106.6509851, 35.0841034) -106.650985 35.084103
3 Albuquerque, New Mexico 2014 43.584022 (-106.6509851, 35.0841034) -106.650985 35.084103
4 Albuquerque, New Mexico 2015 42.227273 (-106.6509851, 35.0841034) -106.650985 35.084103

Now that we have the data we want, we are going to use the plotly.graph_objects and its Scattergeo() method to plot these cities and their AQI on a map. The Scattergeo method takes in a color argument, so we created our own discrete color scale below.

In [22]:
def get_color(aqi):
    if aqi < 30:
        return "grey" 
    elif aqi < 35:
        return "royalblue" 
    elif aqi < 40:
        return "purple"
    elif aqi < 45:
        return "orange"
    else:
        return "red"
                     
df_cities['color'] = df_cities['Mean AQI'].apply(get_color)

The Scattergeo() method also takes in a text argument which is displayed when you hover over the displayed point on the map. We create our text argument that includes City name and AQI.

In [23]:
df_cities['text'] = df_cities['City'] + ', AQI: ' + df_cities['Mean AQI'].apply(str)

The rest of the code required to make the map is quite complicated, so we will not go over each detail explicitly. You can read the in-line comments to get a better idea of what is going on or you can visit this plotly tutorial, which we used as inspiration.

In [24]:
# set year = 2000 frame
data=go.Scattergeo(
        lon = df_cities[df_cities.year == 2000]['lon'],
        lat = df_cities[df_cities.year == 2000]['lat'],
        text = df_cities[df_cities.year == 2000]['text'],
        mode = 'markers',
        marker = dict(
            size = df_cities[df_cities.year == 2000]['Mean AQI']**3/8000, #size of circle
            color = df_cities[df_cities.year == 2000]['color'],
            line_color='rgb(40,40,40)',
        ))

# creates a list of frames. These will be used to create a play/plause animation through the years
frames = [dict(data=[go.Scattergeo(
                    lon = df_cities[df_cities.year == year]['lon'],
                    lat = df_cities[df_cities.year == year]['lat'],
                    text = df_cities[df_cities.year == year]['text'],
                    mode = 'markers',
                    marker = dict(
                    size = df_cities[df_cities.year == year]['Mean AQI']**3/8000,
                    color = df_cities[df_cities.year == year]['color'],
                    line_color='rgb(40,40,40)'
                    )
                )],
                layout = go.Layout(title_text="Cities with Worst AQI " + str(year))
              ) for year in range(2001,2017)]


# sets the layout of the map
layout = go.Layout(
        title = go.layout.Title(
            text = 'Cities With Worst AQI 2000-2016'
        ),
        showlegend = False,
        geo = go.layout.Geo(
            scope = 'world',
            projection = go.layout.geo.Projection(
                type='albers usa'
            ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)",
            showsubunits = True,
            showcountries = True   
        ),
   
    )

# creates the play/pause buttons
layout.update(updatemenus=[ {
        "buttons": [
            # play button
            {
                "args": [None, {"frame": {"duration": 500, "redraw": True},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
            },
            # pause button
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": True},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        # formatting
        "direction": "left",
        "pad": {"r": 10, "t": 0},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
    }])


# finally, create the figure and plot
fig = go.Figure(data= data, layout=layout, frames=frames)
pyo.iplot(fig)

Unforunately, not all cities appear consistently in the dataset, so it is hard to make certain conclusions from this visualization. However, we can still see a great decrease in the number of cities with an AQI > 45 over time. This is consistent with our finding that the mean AQI in the US has decreased. Another interesting finding is the large number of California cities with poor air quality (in all years except 2016, where many cities no longer appear in the dataset). This is probably due to the forest fires in California.


Conclusion

We broke down the data to decipher patterns. From this we can see that there is a correlation with time, specifically season of the year, and pollution level. The hotter months of Spring and Summer have the highest AQI levels on average throughout all the years of 2000-2016. There are still other questions to be explored such as why is there a general trend of pollution AQI levels decreasing over time as seen by our observations? Are events like summer vacation and spring break affecting the emission levels due to increased travel? Or is it the location or just the population since California seems to have high AQI levels based on the heat map.There are still many observations to be made with the dataset along with the help of possibly additional data sets like vacation and population data, but the analysis of the pollution data set revealed very core keypoints such as the affect of season and population on AQI levels, and the finding that there is no correlation within the seasons, months, or days of the week themselves.

This project has allowed us to use our data science knowledge in practical use, and was of great help in solidifying the foundations of data science and the computer science working environment.